In survival analysis we analyse time-to-event data and try to estimate the underlying distribution of times for events to occur.
For the purposes of this workshop, we focus on a single event type, but the topic is wide and broadly applicable.
All materials for this workshop is available in my standard GitHub repo:
https://github.com/kaybenleroll/dublin_r_workshops
book cover
The content of this workshop is partly based on the book “Applied Survival Analysis Using R” by Dirk F. Moore. The data from this book is available from CRAN via the package asaur and there is a GitHub repo for the code in the book also:
This workshop will use a number of different time-to-event datasets, so we look at those first.
The telco churn data contains some customer usage information on mobile phone customers and whether or not they left their subscription accounts.
telco_churn_tbl %>% glimpse()
## Observations: 5,000
## Variables: 21
## $ state <chr> "KS", "OH", "NJ", "OH", "OK", "AL", "MA", "MO", "L...
## $ accountdur <int> 128, 107, 137, 84, 75, 118, 121, 147, 117, 141, 65...
## $ areacode <chr> "415", "415", "415", "408", "415", "510", "510", "...
## $ phonenumber <chr> "382-4657", "371-7191", "358-1921", "375-9999", "3...
## $ intlplan <chr> "no", "no", "no", "yes", "yes", "yes", "no", "yes"...
## $ vmailplan <chr> "yes", "yes", "no", "no", "no", "no", "yes", "no",...
## $ vmailmessage <int> 25, 26, 0, 0, 0, 0, 24, 0, 0, 37, 0, 0, 0, 0, 0, 0...
## $ daymins <dbl> 265.1, 161.6, 243.4, 299.4, 166.7, 223.4, 218.2, 1...
## $ daycalls <int> 110, 123, 114, 71, 113, 98, 88, 79, 97, 84, 137, 1...
## $ daycharge <dbl> 45.07, 27.47, 41.38, 50.90, 28.34, 37.98, 37.09, 2...
## $ evemins <dbl> 197.4, 195.5, 121.2, 61.9, 148.3, 220.6, 348.5, 10...
## $ evecalls <int> 99, 103, 110, 88, 122, 101, 108, 94, 80, 111, 83, ...
## $ evecharge <dbl> 16.78, 16.62, 10.30, 5.26, 12.61, 18.75, 29.62, 8....
## $ nightmins <dbl> 244.7, 254.4, 162.6, 196.9, 186.9, 203.9, 212.6, 2...
## $ nightcalls <int> 91, 103, 104, 89, 121, 118, 118, 96, 90, 97, 111, ...
## $ nightcharge <dbl> 11.01, 11.45, 7.32, 8.86, 8.41, 9.18, 9.57, 9.53, ...
## $ intlmins <dbl> 10.0, 13.7, 12.2, 6.6, 10.1, 6.3, 7.5, 7.1, 8.7, 1...
## $ intlcalls <int> 3, 3, 5, 7, 3, 6, 7, 6, 4, 5, 6, 5, 2, 5, 6, 9, 4,...
## $ intlcharge <dbl> 2.70, 3.70, 3.29, 1.78, 2.73, 1.70, 2.03, 1.92, 2....
## $ custservicecalls <int> 1, 1, 0, 2, 3, 0, 3, 0, 1, 0, 4, 0, 1, 3, 4, 4, 1,...
## $ churned <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...
telco_churn_tbl %>% head()
## # A tibble: 6 x 21
## state accountdur areacode phonenumber intlplan vmailplan vmailmessage daymins
## <chr> <int> <chr> <chr> <chr> <chr> <int> <dbl>
## 1 KS 128 415 382-4657 no yes 25 265.
## 2 OH 107 415 371-7191 no yes 26 162.
## 3 NJ 137 415 358-1921 no no 0 243.
## 4 OH 84 408 375-9999 yes no 0 299.
## 5 OK 75 415 330-6626 yes no 0 167.
## 6 AL 118 510 391-8027 yes no 0 223.
## # ... with 13 more variables: daycalls <int>, daycharge <dbl>, evemins <dbl>,
## # evecalls <int>, evecharge <dbl>, nightmins <dbl>, nightcalls <int>,
## # nightcharge <dbl>, intlmins <dbl>, intlcalls <int>, intlcharge <dbl>,
## # custservicecalls <int>, churned <lgl>
The prostate survival data is generated data from a cancer survival study.
prostate_surv_tbl %>% glimpse()
## Observations: 14,294
## Variables: 5
## $ grade <fct> mode, mode, poor, mode, mode, poor, poor, mode, mode, poor...
## $ stage <fct> T1c, T1ab, T1c, T2, T1c, T2, T1c, T1ab, T1c, T2, T1c, T1c,...
## $ ageGroup <fct> 80+, 75-79, 75-79, 70-74, 70-74, 75-79, 80+, 80+, 75-79, 7...
## $ survTime <int> 18, 23, 37, 27, 42, 38, 18, 78, 47, 13, 81, 23, 21, 13, 11...
## $ status <int> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
prostate_surv_tbl %>% head()
## # A tibble: 6 x 5
## grade stage ageGroup survTime status
## <fct> <fct> <fct> <int> <int>
## 1 mode T1c 80+ 18 0
## 2 mode T1ab 75-79 23 0
## 3 poor T1c 75-79 37 0
## 4 mode T2 70-74 27 0
## 5 mode T1c 70-74 42 0
## 6 poor T2 75-79 38 2
The smoker data is a smoking relapse study involving a number of different treatments and details on when and if the patient resumed smoking.
pharmaco_smoker_tbl %>% glimpse()
## Observations: 125
## Variables: 14
## $ id <int> 21, 113, 39, 80, 87, 29, 16, 35, 54, 70, 84, 85, 25,...
## $ ttr <int> 182, 14, 5, 16, 0, 182, 14, 77, 2, 0, 12, 182, 21, 3...
## $ relapse <int> 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0...
## $ grp <fct> patchOnly, patchOnly, combination, combination, comb...
## $ age <int> 36, 41, 25, 54, 45, 43, 66, 78, 40, 38, 64, 51, 37, ...
## $ gender <fct> Male, Male, Female, Male, Male, Male, Male, Female, ...
## $ race <fct> white, white, white, white, white, hispanic, black, ...
## $ employment <fct> ft, other, other, ft, other, ft, pt, other, ft, ft, ...
## $ yearsSmoking <int> 26, 27, 12, 39, 30, 30, 54, 56, 25, 23, 30, 35, 23, ...
## $ levelSmoking <fct> heavy, heavy, heavy, heavy, heavy, heavy, heavy, lig...
## $ ageGroup2 <fct> 21-49, 21-49, 21-49, 50+, 21-49, 21-49, 50+, 50+, 21...
## $ ageGroup4 <fct> 35-49, 35-49, 21-34, 50-64, 35-49, 35-49, 65+, 65+, ...
## $ priorAttempts <int> 0, 3, 3, 0, 0, 2, 0, 10, 4, 10, 12, 1, 5, 6, 5, 2, 1...
## $ longestNoSmoke <int> 0, 90, 21, 0, 0, 1825, 0, 15, 7, 90, 365, 7, 1095, 1...
pharmaco_smoker_tbl %>% head()
## # A tibble: 6 x 14
## id ttr relapse grp age gender race employment yearsSmoking
## <int> <int> <int> <fct> <int> <fct> <fct> <fct> <int>
## 1 21 182 0 patc… 36 Male white ft 26
## 2 113 14 1 patc… 41 Male white other 27
## 3 39 5 1 comb… 25 Female white other 12
## 4 80 16 1 comb… 54 Male white ft 39
## 5 87 0 1 comb… 45 Male white other 30
## 6 29 182 0 comb… 43 Male hisp… ft 30
## # ... with 5 more variables: levelSmoking <fct>, ageGroup2 <fct>,
## # ageGroup4 <fct>, priorAttempts <int>, longestNoSmoke <int>
In the problems we work on, our data observation process does not allow us to fully observe the variable in question. Instead, our observations are often right-censored - that is, we know the value in question is at least the observed value, but it may also be higher.
Expressing this formally, suppose \(T^*\) is the time to event, and \(U\) is the time to the censoring, the our observed time \(T\), and censoring indicator, \(\delta\), are given by
\[\begin{eqnarray*} T &=& \min(T^*, U) \\ \delta &=& I[T^* \leq U] \end{eqnarray*}\]A less common phenomenon is left-censoring - where we observe the event to be at most a given duration. This may happen in medical studies where continual observation of the patients is not possible or feasible.
Our key goal is to find the survival distribution - the distribution of times from some given start point to the time of the event.
Two common ways of specifying this distribution are the survival function, \(S(t)\) and the hazard function, \(h(t)\).
\(S(t)\) is the probability of surviving to time \(t\), so is defined as follows:
\[ S(t) = P(T > t), \, 0 \leq t \leq \infty \]
Thus, \(S(0) = 1\) and decreases monotonically with time \(t\). As it is a probability it is non-negative.
We can also define the survival function in terms of the instantaneous failure rate, the probability of failing at exactly time \(t\). More formally,
\[ \lambda(t) = \lim_{\delta \to 0} \frac{P(t \leq T \leq t + \delta \; | \, T > t)}{\delta} \]
This is also called the intensity function or the force of mortality
Some analyses make it easier to use the cumulative hazard function
\[ \Lambda(t) = \int^t_0 \lambda(u) \, du \]
As a consequence of this definition we also have
\[ S(t) = \exp(- \Lambda(t)) \]
Finally, two common statistics for survival are the mean survival time and the median survival time:
The mean survival time is the expected value of the distribution, using standard probability definitions:
\[ \mu = E(T) = \int^\infty_0 t f(t) \, dt = \int^\infty_0 S(t) \, dt \]
The median survival time, \(t_{\text{med}}\) such that
\[ S(t_{\text{med}}) = 0.5 \]
Let’s look at some different survival distributions.
t_seq <- 1:100
h_const_seq <- rep_along(t_seq, 0.01)
S_const_seq <- cumprod(c(1, (1 - h_const_seq)))
hazard_plot <- ggplot() +
geom_line(aes(x = t_seq, y = h_const_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Instantaneous Hazard")
cuml_plot <- ggplot() +
geom_line(aes(x = c(0, t_seq), y = S_const_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Cumulative Survival")
plot_grid(hazard_plot, cuml_plot, nrow = 2)
If we increase the constant hazard, we get a similar shape, but the curve declines faster.
h_const_high_seq <- rep_along(t_seq, 0.03)
S_const_high_seq <- cumprod(c(1, (1 - h_const_high_seq)))
hazard_plot <- ggplot() +
geom_line(aes(x = t_seq, y = h_const_high_seq)) +
geom_line(aes(x = t_seq, y = h_const_seq)
,colour = 'blue', linetype = 'dashed') +
expand_limits(y = 0) +
xlab("Time") +
ylab("Instantaneous Hazard")
cuml_plot <- ggplot() +
geom_line(aes(x = c(0, t_seq), y = S_const_high_seq)) +
geom_line(aes(x = c(0, t_seq), y = S_const_seq)
,colour = 'blue', linetype = 'dashed') +
expand_limits(y = 0) +
xlab("Time") +
ylab("Cumulative Survival")
plot_grid(hazard_plot, cuml_plot, nrow = 2)
early_seq <- seq(0.10, 0.01, by = -0.01)
late_seq <- rep(0.01, 100 - length(early_seq))
h_early_seq <- c(early_seq, late_seq)
S_early_seq <- cumprod(c(1, (1 - h_early_seq)))
hazard_plot <- ggplot() +
geom_line(aes(x = t_seq, y = h_early_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Instantaneous Hazard")
cuml_plot <- ggplot() +
geom_line(aes(x = c(0, t_seq), y = S_early_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Cumulative Survival")
plot_grid(hazard_plot, cuml_plot, nrow = 2)
late_seq <- seq(0.01, 0.10, by = 0.0015)
early_seq <- rep(0.01, 100 - length(late_seq))
h_late_seq <- c(early_seq, late_seq)
S_late_seq <- cumprod(c(1, (1 - h_late_seq)))
hazard_plot <- ggplot() +
geom_line(aes(x = t_seq, y = h_late_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Instantaneous Hazard")
cuml_plot <- ggplot() +
geom_line(aes(x = c(0, t_seq), y = S_late_seq)) +
expand_limits(y = 0) +
xlab("Time") +
ylab("Cumulative Survival")
plot_grid(hazard_plot, cuml_plot, nrow = 2)
Kaplan-Meier is the standard method for estimating the survival function of a given dataset. Formally, it is defined as follows
\[ \hat{S}(t) = \prod_{t_i \leq t} (1 - \hat{q}_i) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right) \]
where \(n_i\) is the number of subjects at risk at time \(t\), and \(d_i\) is the number of individuals who fail at that time.
In R, we construct KM estimators using the survfit() function.
Before we move on to our datasets, we start with a small set of data.
tt <- c(7, 6, 6, 5, 2, 4)
cens <- c(0, 1, 0, 0, 1, 1)
grp <- c(0, 0, 0, 1, 1, 1)
Surv(tt, cens)
## [1] 7+ 6 6+ 5+ 2 4
sample_tbl <- data_frame(tt = tt, cens = cens, grp = grp)
example_km <- survfit(Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = 'log-log')
plot(example_km)
Basic plotting routines are worth trying, but the survminer package has specialised plots that use ggplot2 to create them.
ggsurvplot(example_km)
Printing out the ‘fitted’ object gives us some basic statistics:
example_km %>% print()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
##
## n events median 0.95LCL 0.95UCL
## 6 3 6 2 NA
We get more details from the summary() function:
example_km %>% summary()
## Call: survfit(formula = Surv(tt, cens) ~ 1, data = sample_tbl, conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 2 6 1 0.833 0.152 0.2731 0.975
## 4 5 1 0.667 0.192 0.1946 0.904
## 6 3 1 0.444 0.222 0.0662 0.785
A useful measure may be how long the observation period lasts, something that can be subtly difficult to measure.
One method is to switch the censoring labels - that is, we consider the original event as a censoring of the observation in the study.
sample_tbl <- sample_tbl %>%
mutate(follow = 1 - cens)
follow_km <- survfit(Surv(tt, follow) ~ 1, data = sample_tbl, conf.type = 'log-log')
ggsurvplot(follow_km)
follow_km %>% summary()
## Call: survfit(formula = Surv(tt, follow) ~ 1, data = sample_tbl, conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 4 1 0.75 0.217 0.1279 0.961
## 6 3 1 0.50 0.250 0.0578 0.845
## 7 1 1 0.00 NaN NA NA
An empirical estimate of the hazard function is given by
\[ \mu(i) = \frac{d_i}{n_i} \]
This is a very noisy estimate as it is sensitive to sample noise.
To obtain more smooth functions we use kernel density estimator techniques.
sample_muhaz <- muhaz(sample_tbl$tt, sample_tbl$cens, max.time = 7)
plot(sample_muhaz)
broom has tidying methods for muhaz() and this allows us to create plots with ggplot2
muhaz_tidy_tbl <- sample_muhaz %>% tidy()
muhaz_tidy_tbl %>% glimpse()
## Observations: 101
## Variables: 2
## $ time <dbl> 0.00, 0.07, 0.14, 0.21, 0.28, 0.35, 0.42, 0.49, 0.56, 0.63...
## $ estimate <dbl> 0.000000000, 0.000000000, 0.000000000, 0.000871904, 0.0034...
We have estimates of the hazard function now and so can plot it.
ggplot(muhaz_tidy_tbl) +
geom_line(aes(x = time, y = estimate)) +
expand_limits(x = 0, y = 0) +
xlab("Time") +
ylab("Estimated Hazard")
By default, muhaz() corrects for the boundary on both sides, but we may not wish this. To get estimates without this correction, we add it as an argument to the function call.
sample_nocorr_muhaz <- muhaz(sample_tbl$tt, sample_tbl$cens, max.time = 7
,b.cor = 'none')
plot(sample_nocorr_muhaz)
sample_nocorr_muhaz_tidy_tbl <- sample_nocorr_muhaz %>% tidy()
ggplot(sample_nocorr_muhaz_tidy_tbl) +
geom_line(aes(x = time, y = estimate)) +
expand_limits(x = 0, y = 0) +
xlab("Time") +
ylab("Estimated Hazard")
To help ensure these smoothed estimates are capturing the correct aspects of the data, we also have the equivalent pehaz() functions - giving histogram estimates of the hazards, much how histogram and kernel estimates are discrete and continuous analogies of one another.
Now that we have both methods of checking the empirical estimates, it is good to compare them.
Before we do this, we note that Kaplan-Meier estimates the survival function but muhaz() gives us estimates of the hazard function.
Thus we need to do some conversions and the easiest way to do this is to numerically integrate the hazard functions to calculate the survival function and then compare to the KM estimate.
muhaz_surv_tbl <- muhaz_tidy_tbl %>%
mutate(dt = c(0, diff(time))
,S_t = exp(-cumsum(estimate * dt))
)
ggsurvplot(example_km)$plot +
geom_line(aes(x = time, y = S_t), data = muhaz_surv_tbl)
We also want to compare the results without the boundary correction.
muhaz_nocorr_surv_tbl <- sample_nocorr_muhaz_tidy_tbl %>%
mutate(dt = c(0, diff(time))
,S_t = exp(-cumsum(estimate * dt))
)
ggsurvplot(example_km)$plot +
geom_line(aes(x = time, y = S_t), colour = 'blue', data = muhaz_surv_tbl) +
geom_line(aes(x = time, y = S_t), data = muhaz_nocorr_surv_tbl)
We see that removing the boundary corrections can cause big discrepancies between the smoothed and discrete estimates (at least in this case)
We now are going to build various Kaplan-Meier estimates for all the data, as well as splits on the data along vmailplan - whether or not the account had a voice mail, and intlplan - whether or not the account had an international calls plan.
telco_all_km <- survfit(Surv(accountdur, churned) ~ 1, data = telco_churn_tbl)
telco_vmail_km <- survfit(Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
telco_intl_km <- survfit(Surv(accountdur, churned) ~ intlplan, data = telco_churn_tbl)
Having created the KM estimates, we first plot the curves to check for differences.
telco_all_km %>% ggsurvplot(censor = FALSE)
telco_vmail_km %>% ggsurvplot(censor = FALSE)
telco_intl_km %>% ggsurvplot(censor = FALSE)
The estimators comes with upper and lower confidence bounds, and it may be useful to include these estimates when determining the differences. We will look at a more formal way of discriminating between curves in a while.
telco_all_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)
telco_vmail_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)
telco_intl_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)
This is useful, but it would be better to combine all these together. To do this, we need the tables of the estimator data, and stack them together.
Fortunately, the broom package is compatible with basic survival analysis routines, so this provides us with some utilities to help with this.
If we need to access the KM data directly, we can use tidy() to get a look:
telco_all_km_data <- telco_all_km %>% tidy()
telco_all_km_data %>% glimpse()
## Observations: 218
## Variables: 8
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ n.risk <dbl> 5000, 4989, 4987, 4979, 4976, 4974, 4972, 4967, 4965, 496...
## $ n.event <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 2, 0, 1, ...
## $ n.censor <dbl> 10, 1, 7, 2, 2, 2, 5, 2, 3, 3, 6, 4, 9, 1, 5, 9, 6, 4, 8,...
## $ estimate <dbl> 0.999800, 0.999600, 0.999399, 0.999198, 0.999198, 0.99919...
## $ std.error <dbl> 0.000200020, 0.000283183, 0.000347001, 0.000400944, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 0.999984, 0.999984, 0.99998...
## $ conf.low <dbl> 0.999408, 0.999045, 0.998720, 0.998414, 0.998414, 0.99841...
telco_vmail_km_data <- telco_vmail_km %>% tidy()
telco_vmail_km_data %>% glimpse()
## Observations: 404
## Variables: 9
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ n.risk <dbl> 3677, 3670, 3669, 3664, 3662, 3660, 3658, 3655, 3654, 365...
## $ n.event <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, ...
## $ n.censor <dbl> 6, 0, 4, 1, 2, 2, 3, 1, 3, 2, 4, 6, 1, 5, 9, 4, 4, 5, 3, ...
## $ estimate <dbl> 0.999728, 0.999456, 0.999183, 0.998911, 0.998911, 0.99891...
## $ std.error <dbl> 0.000271998, 0.000385030, 0.000471756, 0.000545035, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 0.999978, 0.999978, 0.99997...
## $ conf.low <dbl> 0.999195, 0.998702, 0.998260, 0.997844, 0.997844, 0.99784...
## $ strata <chr> "vmailplan=no", "vmailplan=no", "vmailplan=no", "vmailpla...
telco_intl_km_data <- telco_intl_km %>% tidy()
telco_intl_km_data %>% glimpse()
## Observations: 375
## Variables: 9
## $ time <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ n.risk <dbl> 4527, 4516, 4515, 4508, 4507, 4505, 4503, 4498, 4496, 449...
## $ n.event <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, ...
## $ n.censor <dbl> 10, 1, 6, 1, 2, 2, 5, 2, 3, 3, 6, 3, 9, 1, 3, 8, 6, 4, 7,...
## $ estimate <dbl> 0.999779, 0.999779, 0.999558, 0.999558, 0.999558, 0.99955...
## $ std.error <dbl> 0.000220921, 0.000220921, 0.000312845, 0.000312845, 0.000...
## $ conf.high <dbl> 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.00000...
## $ conf.low <dbl> 0.999346, 0.999346, 0.998945, 0.998945, 0.998945, 0.99894...
## $ strata <chr> "intlplan=no", "intlplan=no", "intlplan=no", "intlplan=no...
We stack the data and create a line plot of each of the survival curves.
stacked_km_tbl <- list(telco_all_km_data %>% mutate(strata = 'ALL')
,telco_vmail_km_data
,telco_intl_km_data
) %>%
bind_rows()
ggplot(stacked_km_tbl) +
geom_line(aes(x = time, y = estimate, colour = strata)) +
geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
geom_line(aes(x = time, y = conf.low, colour = strata), linetype = 'dashed')
This chart is not as useful as it could be - it is too ‘busy’. We need to simplify.
Instead of including everything, we will look at the international plan curves along with the estimator calculated from the full dataset.
ggplot(stacked_km_tbl %>% filter(grepl('ALL|intl', strata))) +
geom_line(aes(x = time, y = estimate, colour = strata)) +
geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
geom_line(aes(x = time, y = conf.low, colour = strata), linetype = 'dashed')
From this, it looks like people without international plans are similar to the population as a whole, but those with plans have a much worse churn effect. This type of information will be useful later.
We can do a similar thing for the voice-mail plans.
ggplot(stacked_km_tbl %>% filter(grepl('ALL|vmail', strata))) +
geom_line(aes(x = time, y = estimate, colour = strata)) +
geom_line(aes(x = time, y = conf.high, colour = strata), linetype = 'dashed') +
geom_line(aes(x = time, y = conf.low, colour = strata), linetype = 'dashed')
This split is more mixed, but it appears that those with a voicemail plan have a better churn experience than the population, but those without are similar to the population as a whole.
Now that we have ways of constructing different survival curves from our data we turn our attention to checking if these differences can be explained by random chance
| Control | Treatment | Total | |
|---|---|---|---|
| Failure | \(d_{0i}\) | \(d_{1i}\) | \(d_i\) |
| Non-failure | \(n_{0i} - d_{0i}\) | \(n_{1i} - d_{1i}\) | \(n_i - d_i\) |
| Total | \(n_{0i}\) | \(n_{1i}\) | \(n_i\) |
From this table of values, we define two values:
\[\begin{eqnarray*} e_{0i} &=& E(d_{0i}) = \frac{n_{0i} d_i}{n_i} \\ v_{0i} &=& \text{var}(d_{0i}) = \frac{n_{0i} n_{1i} d_i (n_i - d_i)}{n_i^2(n_i - 1)} \\ U_0 &=& \sum_{i=1}^{N} (d_{0i} - e_{0i}) = \sum d_{0i} - \sum e_{0i} \\ V_0 &=& \text{var}(U_0) = \sum v_{0i} \end{eqnarray*}\]Having defined these variables as above, we can show that
\[ \frac{U_0^2}{V_0} \sim \chi_1^2 \]
This gives us a natural statistical test to compare the two curves.
The simplest case is when we are comparing survival data split down binary outcomes, such as whether or not the customer has a voicemail plan or a plan with international calls.
We start by comparing the voicemail plans.
telco_vmail_survdiff <- survdiff(Surv(accountdur, churned) ~ vmailplan
,data = telco_churn_tbl)
telco_vmail_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## vmailplan=no 3677 605 523 13.0 50.1
## vmailplan=yes 1323 102 184 36.8 50.1
##
## Chisq= 50.1 on 1 degrees of freedom, p= 1e-12
telco_vmail_survdiff %>% glance()
## # A tibble: 1 x 3
## statistic df p.value
## <dbl> <dbl> <dbl>
## 1 50.1 1 1.48e-12
Such a small \(p\)-value suggests the differences in the curves are unlikely to be due to chance, and this gels with what we observed on the plots of the curves.
Now we compare the differences between plans with and without international calls.
telco_intl_survdiff <- survdiff(Surv(accountdur, churned) ~ intlplan
,data = telco_churn_tbl)
telco_intl_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ intlplan, data = telco_churn_tbl)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## intlplan=no 4527 508 638.6 26.7 277
## intlplan=yes 473 199 68.4 249.3 277
##
## Chisq= 277 on 1 degrees of freedom, p= <2e-16
telco_intl_survdiff %>% glance()
## # A tibble: 1 x 3
## statistic df p.value
## <dbl> <dbl> <dbl>
## 1 277. 1 0
The approach discussed can be extended to categorical variables with multiple values. We can use areacode for this as it has three values in it:
telco_churn_tbl %>% count(areacode)
## # A tibble: 3 x 2
## areacode n
## <chr> <int>
## 1 408 1259
## 2 415 2495
## 3 510 1246
Let’s check how it splits the survival curves.
telco_areacode_km <- survfit(Surv(accountdur, churned) ~ areacode
,data = telco_churn_tbl)
telco_areacode_km %>% ggsurvplot(censor = TRUE, conf.int = TRUE)
Now let’s run the log-rank test on this data:
telco_areacode_survdiff <- survdiff(Surv(accountdur, churned) ~ areacode
,data = telco_churn_tbl)
telco_areacode_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ areacode, data = telco_churn_tbl)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## areacode=408 1259 177 185 0.3357 0.4562
## areacode=415 2495 346 349 0.0236 0.0468
## areacode=510 1246 184 173 0.6667 0.8866
##
## Chisq= 1 on 2 degrees of freedom, p= 0.6
telco_areacode_survdiff %>% glance()
## # A tibble: 1 x 3
## statistic df p.value
## <dbl> <dbl> <dbl>
## 1 1.03 2 0.598
Again, the results of the log-rank test meshes with our visual inspection: a \(p\)-value of 0.6 suggests differences are likely due to chance, and for most of the curves the three area codes pretty much share the same experience.
Using our domain knowledge, this also makes sense - we would not expect phones based in different area codes to have substantial differences: area codes are too big and broad to show systemic geographic effects.
In the event that we have numeric variables affecting survival, we need to perform some discretization of those values to allow us to assess these effects.
Regression models discussed later will allow us tackle these dependencies in a more direct way, but for now we will use this simplified approach.
One numeric variable likely to have an effect on churn is custservicecalls - the number of calls made to customer service. We would expect that an account with more service calls is likely to be dissatisfied, so higher counts of calls corresponds to a higher churn rate.
custservicecallsHaving decided to bin our calls, we now need to decide how we go about this. One way is to split the data by quantiles, so we can try this first.
Should this not prove satisfactory, it may help inform more suitable ways to bin this value.
telco_churn_tbl %>% count(custservicecalls) %>% mutate(cuml = cumsum(n))
## # A tibble: 10 x 3
## custservicecalls n cuml
## <int> <int> <int>
## 1 0 1023 1023
## 2 1 1786 2809
## 3 2 1127 3936
## 4 3 665 4601
## 5 4 252 4853
## 6 5 96 4949
## 7 6 34 4983
## 8 7 13 4996
## 9 8 2 4998
## 10 9 2 5000
We see this is a very skewed distribution to the lower end of call counts, so ntile() is not as effective here as it will not properly separate the counts. Instead we use cut() as that gives us more control over the binning.
telco_churn_cat_tbl <- telco_churn_tbl %>%
mutate(calls_cat = cut(custservicecalls
,breaks = c(0, 1, 2, 5, Inf)
,right = FALSE)
)
telco_callscat_km <- survfit(Surv(accountdur, churned) ~ calls_cat
,data = telco_churn_cat_tbl)
telco_callscat_km %>% ggsurvplot(censor = FALSE, conf.int = TRUE)
It looks like there is not much difference for lower counts, but there is a worse churn experience when the number of customer service calls is 5 or higher.
telco_callscat_survdiff <- survdiff(Surv(accountdur, churned) ~ calls_cat
,data = telco_churn_cat_tbl)
telco_callscat_survdiff %>% print()
## Call:
## survdiff(formula = Surv(accountdur, churned) ~ calls_cat, data = telco_churn_cat_tbl)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## calls_cat=[0,1) 1023 121 145.1 3.99 5.04
## calls_cat=[1,2) 1786 190 254.7 16.45 25.82
## calls_cat=[2,5) 2044 306 285.4 1.49 2.51
## calls_cat=[5,Inf) 147 90 21.8 213.33 220.88
##
## Chisq= 236 on 3 degrees of freedom, p= <2e-16
telco_callscat_survdiff %>% glance()
## # A tibble: 1 x 3
## statistic df p.value
## <dbl> <dbl> <dbl>
## 1 236. 3 0
Up to this point we have focused on empirical and non-parametric approaches to estimating survival and hazard functions.
While useful, these approaches do not work when we want to consider multiple variables simultaneously. We could try to construct different KM estimates according to the splits, but splitting along multiple directions rapidly diminishes the amount of data in each split.
Instead, we start to build regression models to estimate the hazard and survival functions.
The core assumption for the proportional hazards model is that we model the hazard function and construct the survival curve from that.
We assume we have a reference hazard function of time, \(h_0(t)\), known as the baseline hazard. All hazard functions are proportional to this baseline:
\[ h(t) = \psi \, h_0(t) \]
Implicit in the above statement is that the time-effects of the hazard are built into the hazard function \(h_0(t)\) - there is no time-varying effect in the coefficients in the Cox PH model.
The hazards can be greater or less than this baseline - \(\psi > 1\) means the hazard is greater than the baseline and the survival curve will be below the baseline curve; \(\psi < 1\) means it is less.
We model this constant factor \(\psi\) as a linear model - we have covariates \(x_i\) as inputs to the model, and each covariate has a corresponding coefficient \(\beta_i\).
\[ \psi = \exp(\beta \, x) = \exp(\beta_1 x_1 + ... + \beta_n x_x) \]
Thus, stated fully, our Cox PH model can be expressed as follows:
\[ h(t | X) = h_0(t) \exp(\beta \, X) \]
As our data is censored we need to modify our standard maximum likelihood approaches to estimating the parameters of a model - instead we need to use ‘partial likelihoods’ that takes into account the censoring.
We will not go into the details, but this method reduces to optimising a function over the parameter space. This has the side benefit that once we have these parameter values, we can use this to calculate the baseline hazard also.
With this in mind, we now turn our attention to actually fitting the models.
For our first Cox-PH model, we build a model using the voicemail parameter.
telco_model01_coxph <- coxph(Surv(accountdur, churned) ~ vmailplan
,data = telco_churn_tbl
)
telco_model01_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
##
## n= 5000, number of events= 707
##
## coef exp(coef) se(coef) z Pr(>|z|)
## vmailplanyes -0.741 0.477 0.107 -6.92 4.7e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes 0.477 2.1 0.387 0.588
##
## Concordance= 0.565 (se = 0.01 )
## Rsquare= 0.011 (max possible= 0.88 )
## Likelihood ratio test= 56.6 on 1 df, p=5e-14
## Wald test = 47.8 on 1 df, p=5e-12
## Score (logrank) test = 50 on 1 df, p=2e-12
Now that we have built our first survival regression model, how do we assess how good the model is?
Model assessment in survival analysis is less straightforward as the models produce a distribution rather than a number. We can discuss more robust assessments later, but for now it is worth introducing a number of quantitative measures of model quality.
The standard measure of model fit is \(R^2\), the proportion of variance explained by the model. This does not quite work for survival models, but we can use a modified \(R^2\) statistic, defined for Cox-PH models as follows:
\[ R^2 = 1 - \left( \frac{l(0)}{l(\hat{\beta})} \right)^{\frac{2}{n}} \]
where \(l(\beta)\) is the log-likelihood function.
The concordance index, or \(c\)-index, or \(c\)-statistic is the proportion of pairwise comparisons correctly identified by the model. To ensure a fair comparison at least one of the rows must not be censored.
We compare comparison times and check that our model correctly predicted the order of times between the compared datapoints - a correct value adds 1 to the total, an incorrect one adds 0. We then divide the total by the number of comparisons made to get our statistic.
Thus, a concordance index of 0.5 is equivalent to a random guess and so is our baseline value. A good survival model will have a \(c\)-index between 0.6 and 0.7 - anything higher is unlikely due to the censored nature of our data.
So how good is our model?
telco_model01_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan, data = telco_churn_tbl)
##
## n= 5000, number of events= 707
##
## coef exp(coef) se(coef) z Pr(>|z|)
## vmailplanyes -0.741 0.477 0.107 -6.92 4.7e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes 0.477 2.1 0.387 0.588
##
## Concordance= 0.565 (se = 0.01 )
## Rsquare= 0.011 (max possible= 0.88 )
## Likelihood ratio test= 56.6 on 1 df, p=5e-14
## Wald test = 47.8 on 1 df, p=5e-12
## Score (logrank) test = 50 on 1 df, p=2e-12
telco_model01_coxph %>%
glance() %>%
select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
## n nevent r.squared concordance std.error.concordance
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5000 707 0.0113 0.565 0.00960
telco_model01_coxph %>% tidy()
## # A tibble: 1 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 vmailplanyes -0.741 0.107 -6.92 4.67e-12 -0.951 -0.531
This simple model with one binary variable gives us a \(c\)-index of 0.565. The \(R^2\) is essentially zero (0.011), and we will check how both values correlate as we iterate upon the model-building.
For our second model we add international calls, adding it as a single variable before we see what adding both does..
telco_model02_coxph <- coxph(Surv(accountdur, churned) ~ intlplan
,data = telco_churn_tbl
)
telco_model02_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ intlplan, data = telco_churn_tbl)
##
## n= 5000, number of events= 707
##
## coef exp(coef) se(coef) z Pr(>|z|)
## intlplanyes 1.3023 3.6778 0.0838 15.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## intlplanyes 3.68 0.272 3.12 4.33
##
## Concordance= 0.591 (se = 0.006 )
## Rsquare= 0.038 (max possible= 0.88 )
## Likelihood ratio test= 194 on 1 df, p=<2e-16
## Wald test = 242 on 1 df, p=<2e-16
## Score (logrank) test = 278 on 1 df, p=<2e-16
Model02 has a \(c\)-index of 0.591, so the international plan has more predictive power. Note that \(R^2\) is increasing with \(c\)-index, so this relationship is worth investigating later.
telco_model02_coxph %>%
glance() %>%
select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
## n nevent r.squared concordance std.error.concordance
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5000 707 0.0380 0.591 0.00646
telco_model02_coxph %>% tidy()
## # A tibble: 1 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intlplanyes 1.30 0.0838 15.5 1.62e-54 1.14 1.47
Having checked the variables separately, we now combine them into a single model containing both variables. We expect combining both variables to improve our model, but if both are highly correlated this may not be the case.
telco_model03_coxph <- coxph(Surv(accountdur, churned) ~ vmailplan + intlplan
,data = telco_churn_tbl
)
telco_model03_coxph %>% summary()
## Call:
## coxph(formula = Surv(accountdur, churned) ~ vmailplan + intlplan,
## data = telco_churn_tbl)
##
## n= 5000, number of events= 707
##
## coef exp(coef) se(coef) z Pr(>|z|)
## vmailplanyes -0.7676 0.4641 0.1071 -7.16 7.8e-13 ***
## intlplanyes 1.3199 3.7429 0.0838 15.75 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## vmailplanyes 0.464 2.155 0.376 0.573
## intlplanyes 3.743 0.267 3.176 4.411
##
## Concordance= 0.643 (se = 0.01 )
## Rsquare= 0.05 (max possible= 0.88 )
## Likelihood ratio test= 255 on 2 df, p=<2e-16
## Wald test = 294 on 2 df, p=<2e-16
## Score (logrank) test = 332 on 2 df, p=<2e-16
The combined model, Model03, has a \(c\)-index of 0.643 and an \(R^2\) of 0.05.
telco_model03_coxph %>%
glance() %>%
select(n, nevent, r.squared, concordance, std.error.concordance)
## # A tibble: 1 x 5
## n nevent r.squared concordance std.error.concordance
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 5000 707 0.0497 0.643 0.0105
telco_model03_coxph %>% tidy()
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 vmailplanyes -0.768 0.107 -7.16 7.80e-13 -0.978 -0.558
## 2 intlplanyes 1.32 0.0838 15.8 6.82e-56 1.16 1.48
This model appears to be better, and there are many more variables we can try to improve the model.
Before we spend time and effort iterating further on this model, we now spend some time checking the proportional hazards assumption. As we might expect, we have tools to help with this.
We have skipped much of the statistical theory up to this point, but we now need to look at using some of it.
A common statistical approach for assessing regression models is through analysis of the residuals, typically defined as some form of difference between the predicted and observed values.
For survival analysis, our approach is more involved in terms of how we define the residuals of a model, but the basic concept is the same.
We first work from a null CoxPH model that we fit - the ‘null’ model:
telco_model00_coxph <- coxph(Surv(accountdur, churned) ~ 1
,data = telco_churn_tbl
)
telco_model00_coxph %>% summary()
## Call: coxph(formula = Surv(accountdur, churned) ~ 1, data = telco_churn_tbl)
##
## Null model
## log likelihood= -5293.32
## n= 5000
We define our basic residual \(m_i\) as follows:
\[ m_i = \delta_i - \hat{H}_0(t_i) \exp(x_i \hat{\beta}) \]
The residual is the difference between the observed value of the whether the event occurred and its expected value under the Cox model.
telco_model00_resid_m <- telco_model00_coxph %>% residuals(type = 'martingale')
telco_model00_resid_m %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2069 -0.1805 -0.0704 0.0000 -0.0166 0.9998
We have zero-mean, but there is skew in the data as the values are bounded above at 1, but unbounded below. The easiest way to check the shape is look at a histogram of the values.
ggplot() +
geom_histogram(aes(x = telco_model00_resid_m), bins = 50) +
xlab('Residual Value') +
ylab('Count') +
ggtitle('Histogram of Null-Model Martingale Residuals')
ggplot() +
geom_histogram(aes(x = telco_model00_resid_m^2), bins = 50) +
xlab('Squared Residual Value') +
ylab('Count') +
ggtitle('Histogram of Null-Model Squared Martingale Residuals')
Martingale residuals cannot be used in sums of squares methods to assess goodness-of-fit - instead we define another quantity when the model is correct is symmetrically distributed with an expectation of zero. These ‘deviance’ residuals, \(d_i\) are defined as follows:
\[ d_i = \text{sign}(m_i) {-2 [m_i + \delta_i \log (\delta_i - m_i)]}^{1/2} \]
telco_model00_resid_d <- telco_model00_coxph %>% residuals(type = 'deviance')
telco_model00_resid_d %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.554 -0.601 -0.375 -0.159 -0.182 3.877
The mean is small, but non-zero. We inspect the distribution of values to see what it looks like.
ggplot() +
geom_histogram(aes(x = telco_model00_resid_d), bins = 50) +
xlab('Residual Value') +
ylab('Count') +
ggtitle('Histogram of Null-Model Deviance Residuals')
ggplot() +
geom_histogram(aes(x = telco_model00_resid_d^2), bins = 50) +
xlab('Squared Residual Value') +
ylab('Count') +
ggtitle('Histogram of Null-Model Squared Deviance Residuals')
Neither look distributed as we would like, suggesting this model is not good, but we will return to this topic later.
Much like for linear models, these residuals are useful to help decide which variables may have predictive power in our survival model. We do this by plotting the values of the residual against the variable values.
To do this, we first add the residuals to the original dataset so that we can construct some of the residual plots.
telco_null_assess_tbl <- telco_churn_tbl %>%
mutate(resid_m = telco_model00_resid_m
,resid_d = telco_model00_resid_d
)
Now that we have put the table together we can construct some two-way plots of the variable against the martingale residual:
ggplot(telco_null_assess_tbl) +
geom_boxplot(aes(x = vmailplan, y = resid_m)) +
xlab('vmailplan') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against vmailplan')
ggplot(telco_null_assess_tbl) +
geom_boxplot(aes(x = intlplan, y = resid_m)) +
xlab('intlplan') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against intlplan')
ggplot(telco_null_assess_tbl) +
geom_boxplot(aes(x = areacode, y = resid_m)) +
xlab('areacode') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against areacode')
ggplot(telco_null_assess_tbl) +
geom_boxplot(aes(x = custservicecalls, y = resid_m, group = custservicecalls)) +
xlab('custservicecalls') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against custservicecalls')
custservicecalls is a numeric variable, but it does not take many values so we can create boxplots rather than a more typical scatterplot.
For continuous variables, we look at those scatterplots:
ggplot(telco_null_assess_tbl) +
geom_point(aes(x = vmailmessage, y = resid_m), size = 1, alpha = 0.1) +
xlab('vmailmessage') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against vmailmessage')
ggplot(telco_null_assess_tbl) +
geom_point(aes(x = daymins, y = resid_m), size = 1, alpha = 0.1) +
xlab('daymins') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against daymins')
ggplot(telco_null_assess_tbl) +
geom_point(aes(x = nightmins, y = resid_m), size = 1, alpha = 0.1) +
xlab('nightmins') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against nightmins')
ggplot(telco_null_assess_tbl) +
geom_point(aes(x = intlmins, y = resid_m), size = 1, alpha = 0.1) +
xlab('intlmins') +
ylab('Martingale Residual') +
ggtitle('Residual Plot of Martingal Residual against intlmins')
No strong patterns stand out here, but that does not necessarily rule them out of candidate models. We will not know the predictive power till we try them in a model.
The proportional hazards assumption is key in the construction of the partial likelihood, as it allows the cancellation of the baseline hazard in the calculations.
This assumption of proportional hazards is likely to be violated in practice, so the question is how big the effect those violations have on our model. Formal statistical tests are thus of limited value, but can be a useful indicator of issues with any models built.
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.5.1 (2018-07-02)
## os Ubuntu 18.04.1 LTS
## system x86_64, linux-gnu
## ui RStudio
## language en_GB:en
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/Dublin
## date 2018-10-25
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## asaur * 0.50 2016-04-12 [1] CRAN (R 3.5.0)
## assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
## backports 1.1.2 2017-12-13 [1] CRAN (R 3.5.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.5.0)
## bindr 0.1.1 2018-03-13 [1] CRAN (R 3.5.0)
## bindrcpp * 0.2.2 2018-03-29 [1] CRAN (R 3.5.0)
## broom * 0.5.0 2018-07-17 [1] CRAN (R 3.5.1)
## callr 3.0.0 2018-08-24 [1] CRAN (R 3.5.1)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.0)
## cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
## cmprsk 2.2-7 2014-06-17 [1] CRAN (R 3.5.0)
## colorspace 1.3-2 2016-12-14 [1] CRAN (R 3.5.0)
## cowplot * 0.9.3 2018-07-15 [1] CRAN (R 3.5.1)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
## data.table 1.11.8 2018-09-30 [1] CRAN (R 3.5.1)
## debugme 1.1.0 2017-10-22 [1] CRAN (R 3.5.0)
## desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0)
## devtools * 2.0.0 2018-10-19 [1] CRAN (R 3.5.1)
## digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
## dplyr * 0.7.7 2018-10-16 [1] CRAN (R 3.5.1)
## evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1)
## fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.1)
## forcats * 0.3.0 2018-02-19 [1] CRAN (R 3.5.0)
## fortunes 1.5-4 2016-12-29 [1] CRAN (R 3.5.0)
## fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.1)
## ggplot2 * 3.0.0 2018-07-03 [1] CRAN (R 3.5.1)
## ggpubr * 0.1.8 2018-08-30 [1] CRAN (R 3.5.1)
## glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.1)
## gridExtra 2.3 2017-09-09 [1] CRAN (R 3.5.0)
## gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0)
## haven 1.1.2 2018-06-27 [1] CRAN (R 3.5.0)
## hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.0)
## htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
## httr 1.3.1 2017-08-20 [1] CRAN (R 3.5.0)
## jsonlite 1.5 2017-06-01 [1] CRAN (R 3.5.0)
## km.ci 0.5-2 2009-08-30 [1] CRAN (R 3.5.0)
## KMsurv 0.1-5 2012-12-03 [1] CRAN (R 3.5.0)
## knitr 1.20 2018-02-20 [1] CRAN (R 3.5.0)
## labeling 0.3 2014-08-23 [1] CRAN (R 3.5.0)
## lattice 0.20-35 2017-03-25 [1] CRAN (R 3.5.0)
## lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0)
## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0)
## magrittr * 1.5 2014-11-22 [1] CRAN (R 3.5.0)
## Matrix 1.2-14 2018-04-09 [1] CRAN (R 3.5.0)
## memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0)
## modelr 0.1.2 2018-05-11 [1] CRAN (R 3.5.0)
## muhaz * 1.2.6 2014-08-09 [1] CRAN (R 3.5.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0)
## nlme 3.1-137 2018-04-07 [1] CRAN (R 3.5.0)
## pillar 1.3.0 2018-07-14 [1] CRAN (R 3.5.1)
## pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.1)
## pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
## pkgload 1.0.1 2018-10-11 [1] CRAN (R 3.5.1)
## plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0)
## prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0)
## processx 3.2.0 2018-08-16 [1] CRAN (R 3.5.1)
## ps 1.2.0 2018-10-16 [1] CRAN (R 3.5.1)
## purrr * 0.2.5 2018-05-29 [1] CRAN (R 3.5.0)
## R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
## Rcpp 0.12.19 2018-10-01 [1] CRAN (R 3.5.1)
## readr * 1.1.1 2017-05-16 [1] CRAN (R 3.5.0)
## readxl 1.1.0 2018-04-20 [1] CRAN (R 3.5.0)
## remotes 2.0.1 2018-10-19 [1] CRAN (R 3.5.1)
## rlang 0.3.0 2018-10-22 [1] CRAN (R 3.5.1)
## rmarkdown 1.10 2018-06-11 [1] CRAN (R 3.5.0)
## rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0)
## rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1)
## rvest 0.3.2 2016-06-17 [1] CRAN (R 3.5.0)
## scales * 1.0.0 2018-08-09 [1] CRAN (R 3.5.1)
## sessioninfo 1.1.0 2018-09-25 [1] CRAN (R 3.5.1)
## stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
## stringr * 1.3.1 2018-05-10 [1] CRAN (R 3.5.0)
## survival * 2.42-6 2018-07-13 [1] CRAN (R 3.5.1)
## survminer * 0.4.3 2018-08-04 [1] CRAN (R 3.5.1)
## survMisc 0.5.5 2018-07-05 [1] CRAN (R 3.5.1)
## testthat 2.0.1 2018-10-13 [1] CRAN (R 3.5.1)
## tibble * 1.4.2 2018-01-22 [1] CRAN (R 3.5.0)
## tidyr * 0.8.1 2018-05-18 [1] CRAN (R 3.5.0)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1)
## tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.0)
## usethis * 1.4.0 2018-08-14 [1] CRAN (R 3.5.1)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
## xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0)
## xtable 1.8-3 2018-08-29 [1] CRAN (R 3.5.1)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
## zoo 1.8-4 2018-09-19 [1] CRAN (R 3.5.1)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/lib/R/site-library
## [3] /usr/lib/R/library